Skip to content

Network churn Load test - Add network policy enforcement latency measurement#431

Closed
agrawaliti wants to merge 52 commits intomainfrom
network-churn
Closed

Network churn Load test - Add network policy enforcement latency measurement#431
agrawaliti wants to merge 52 commits intomainfrom
network-churn

Conversation

@agrawaliti
Copy link
Copy Markdown

@agrawaliti agrawaliti commented Dec 12, 2024

Integrate network policy enforcement latency measurement.

Developed pipelines to compare network policy-related metrics between Azure powered by Cilium and Azure powered by CNI Overlay using Network Policy Manager.
All the configuration like nodes, pods, namespaces, no. of policies per namespace can be updated in pipeline.

Pipeline: https://dev.azure.com/akstelescope/telescope/_build?definitionId=41

Dashboard with new metrics: https://dataexplorer.azure.com/dashboards/e033bb3b-2cf4-4263-b41b-31597a8c4401?p-_startTime=24hours&p-_endTime=now&p-_cluster=v-cilium_network_churn_main&p-_test-type=v-default-config#5117e0aa-eb12-4f7f-b55d-6ffba1eab4ad

@agrawaliti agrawaliti marked this pull request as ready for review December 30, 2024 15:19
@agrawaliti agrawaliti changed the title Network churn Network churn Load test - Add network policy enforcement latency measurement Dec 30, 2024
@agrawaliti
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="Microsoft"

Comment thread jobs/competitive-test.yml
- name: run_id
type: string
default: ''
- name: run_id_2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is run id 2 for?

Copy link
Copy Markdown
Author

@agrawaliti agrawaliti Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using two different pre created cluster for azure_cilium and azure_cni_overlay and I am passing those two clusters using run_id and run_id_2, as creating two new cluster for every run with 1000 nodes each takes a very long time, so I am passing two cluster tags to run tests on them.

On second thought I am thinking i can do it with terraform and schedule it to run periodically.

Comment thread jobs/competitive-test.yml
- name: ssh_key_enabled
type: boolean
default: true
- name: use_secondary_cluster
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is secondary cluster for?


variables:
SCENARIO_TYPE: perf-eval
SCENARIO_NAME: cilium-network-churn
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

network-policy-churn

parameters:
role: net
region: ${{ parameters.regions[0] }}
- template: /steps/engine/clusterloader2/cilium/scale-cluster.yml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this in terraform when setting up cluster?

Comment thread steps/setup-tests.yml
- name: run_id
type: string
default: ''
- name: run_id_2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment thread steps/setup-tests.yml
type: string
- name: ssh_key_enabled
type: boolean
- name: use_secondary_cluster
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Contributor

@jshr-w jshr-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to (1) Minimize changes that touch code that other pipelines use (2) After minimizing those changes, need to run the other automated pipelines off this branch to ensure they aren't broken.

# Service test
{{$BIG_GROUP_SIZE := DefaultParam .BIG_GROUP_SIZE 4000}}
{{$SMALL_GROUP_SIZE := DefaultParam .SMALL_GROUP_SIZE 20}}
{{$SMALL_GROUP_SIZE := DefaultParam .CL2_DEPLOYMENT_SIZE 20}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this CL2_SMALL_GROUP_SIZE to keep the variable naming coordinated?

{{$SMALL_GROUP_SIZE := DefaultParam .CL2_DEPLOYMENT_SIZE 20}}
{{$bigDeploymentsPerNamespace := DefaultParam .bigDeploymentsPerNamespace 1}}
{{$smallDeploymentPods := SubtractInt $podsPerNamespace (MultiplyInt $bigDeploymentsPerNamespace $BIG_GROUP_SIZE)}}
{{$smallDeploymentPods := DivideInt $totalPods $namespaces}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to break all the other tests, right? Could you please restore this, and probably create a parameter for bigDeployments and set that to 0 instead.

deleteStaleNamespaces: true
deleteAutomanagedNamespaces: true
enableExistingNamespaces: false
enableExistingNamespaces: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may break testing. If namespaces weren't deleted by the previous run, we should be aware (many existing pipelines are dependent on this). Let's restore to the original value.

objectTemplatePath: deployment_template.yaml
templateFillMap:
Replicas: {{$bigDeploymentSize}}
Replicas: {{$bigDeploymentSize}}kube
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo?

Group: {{.Group}}
deploymentLabel: {{.deploymentLabel}}
{{end}}
- namespaceRange:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want this code to execute if we are not running a network policy right? Shouldn't we 'else' gate this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont wanna run bigDeployment for network test so i have added {{if not $NETWORK_TEST}} for big deployment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it works for your scenario but it will break the others -- If not NETWORK_TEST, what is stopping the pipeline from running both phases?

def calculate_config(cpu_per_node, node_count, provider, service_test):
def calculate_config(cpu_per_node, node_per_step, pods_per_node, provider):
throughput = 100
nodes_per_namespace = min(node_count, DEFAULT_NODES_PER_NAMESPACE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes made to pods_per_node are going to break many of the existing pipelines. After this change, how can the service test use 20 pods per node? We need to be careful adding parameters given the number of pipelines that are dependent on them... IMO the safest way will be to have an IF branch here, and possibly a parameter for pods per node ONLY for the network_test.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my opinion having pods_per_node as a hard constant which can change frequently based on usecase in not a good approach, I have added that parameter in pipeline configuration so if we need custom value we can set it in pipeline, else if unset it will keep working as before i.e default - 40

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default isn't the same for all pipelines though, so something would break.

for consistency, instead let's take the same approach as #456, using the max_pods param to configure pods_per_node for this pipeline.

"measurement": None,
"result": None,
# "test_details": details,
# # "test_details": details,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

parser_configure.add_argument("pods_per_node", type=int, help="Number of pods per node")
parser_configure.add_argument("repeats", type=int, help="Number of times to repeat the deployment churn")
parser_configure.add_argument("operation_timeout", type=str, help="Timeout before failing the scale up test")
parser_configure.add_argument("no_of_namespaces", type=int, default=1, help="Number of namespaces to create")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding arguments without a default forced (need to set nargs) will probably break all the other pipelines in my understand... this comment applies to all arguments added

az aks nodepool update --cluster-name $aks_name --name $np --resource-group $aks_rg --node-taints "slo=true:NoSchedule" --labels slo=true
sleep 300
az aks nodepool update --cluster-name $aks_name --name $np --resource-group $aks_rg --labels slo=true test-np=net-policy-client
# sleep 300
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to affect all the other pipelines... please let's be careful!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello I pushed some test commits yesterday, I am cleaning it up. Thanks for pointing out

@@ -2,6 +2,7 @@ name: load-config

# Config options for test type
{{$SERVICE_TEST := DefaultParam .CL2_SERVICE_TEST true}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is not explicitly overridden in slo.py, so you are going to create services in this test even though you set it to false in cilium-network-churn.yml. It has been fixed in main, please rebase. We do not want services created in this test.

node_per_step: ${{ parameters.node_per_step }}
max_pods: 100
pods_per_node: ${{ parameters.pods_per_node }}
repeats: ${{ parameters.repeats }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to run this test with 10 repeats for data collection.

- eastus2
engine: clusterloader2
engine_input:
image: "ghcr.io/agrawaliti/clusterloader2:latest"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why we are using this image. We should be using the latest image from Azure/perf-tests. Has a change been made to the perf-tests fork? In which case it needs to be reviewed so we can get an updated image. Otherwise, we should use the latest clusterloader2 image like the other pipelines.

@agrawaliti agrawaliti closed this Apr 17, 2025
@agrawaliti agrawaliti deleted the network-churn branch April 17, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants